Toward invariant functional representations of variable surface fundamental frequency contours: Synthesizing speech melody via model-based stochastic learning
نویسندگان
چکیده
Variability has been one of the major challenges for both theoretical understanding and computer synthesis of speech prosody. In this paper we show that economical representation of variability is the key to effective modeling of prosody. Specifically, we report the development of PENTAtrainer — A trainable yet deterministic prosody synthesizer based on an articulatory-functional view of speech. We show with testing results on Thai, Mandarin and English that it is possible to achieve high-accuracy predictive synthesis of fundamental frequency contours with very small sets of parameters obtained through stochastic learning from real speech data. The first key component of this system is syllable-synchronized sequential target approximation — implemented as the qTA model, which is designed to simulate, for each tonal unit, a wide range of contextual variability with a single invariant target. The second key component is the automatic learning of function-specific targets through stochastic global optimization, guided by a layered pseudo-hierarchical functional annotation scheme, which requires the manual labeling of only the temporal domains of the functional units. The results in terms of synthesis accuracy demonstrate that effective modeling of the contextual variability is the key also to effective modeling of function-related variability. Additionally, we show that, being both theory-based and trainable (hence data-driven), computational systems like PENTAtrainer can serve as an effective modeling tool in basic research, with which the level of falsifiability in theory testing can be raised, and also a closer link between basic and applied research in speech science can be developed. 3 Graphical Abstract Highlights (maximum 85 characters/bullet) • High synthetic accuracy of prosody achieved for Thai, Mandarin and English • Many-to-one mapping from contextually variable surface F 0 to invariant functional targets • Effectively handling of both contextual and non-contextual variability • Combination of deterministic synthesis and data-driven parameter learning • Large-scale and full-detailed prosody synthesis as tool for theory testing • Freely available as a Praat scripts and plug-ins to the speech science community at large
منابع مشابه
Uncertainty in fundamental natural frequency estimation for alluvial deposits
Seismic waves are filtered as they pass through soil layers, from bedrock to surface. Frequencies and amplitudes of the response wave are affected due to this filtration effect and this will result in different ground motion characteristics. Therefore, it is important to consider the impact of the soil properties on the evaluation of earthquake ground motions for the design of structures. Soil ...
متن کاملSynthesizing intonation of standard arabic language
In this paper, we propose a model to generate fundamental frequency (F0) contours using neural networks. A learning procedure is proposed as an alternative to synthesis-by-rules. The generation of correct fundamental frequency contour is one of the important issues in the naturalness of automatic text-to-speech conversion systems. The proposed approach is based on a standard feed-forward multi-...
متن کاملCorpus-Based Hidden Markov Modelling of the Fundamental Frequency of Lithuanian
This paper presents the corpus-driven approach in building the computational model of fundamental frequency, or F0, for Lithuanian language. The model was obtained by training the HMM-based speech synthesis system HTS on six hours of speech coming from multiple speakers. Several gender specific models, using different parameters and different contextual factors, were investigated. The models we...
متن کاملNewborns' Cry Melody Is Shaped by Their Native Language
Human fetuses are able to memorize auditory stimuli from the external world by the last trimester of pregnancy, with a particular sensitivity to melody contour in both music and language. Newborns prefer their mother's voice over other voices and perceive the emotional content of messages conveyed via intonation contours in maternal speech ("motherese"). Their perceptual preference for the surr...
متن کاملPENTATrainer2: A hypothesis-driven prosody modeling tool
Prosody is an essential aspect of speech, as it carries both lexical and non-lexical information. A conventional approach to studying speech prosody is to collect and analyze F0 data based on certain hypotheses and then develop a theory based on the observation, which constitutes the final conclusion of the study. This process is however far from complete, as the developed theory has not been a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Speech Communication
دوره 57 شماره
صفحات -
تاریخ انتشار 2014